5 research outputs found

    Runtime-guided management of stacked DRAM memories in task parallel programs

    Get PDF
    Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreement 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Improving throughput of simultaneous multithreaded (SMT) processors using shareable resource signatures and hardware thread priorities

    No full text
    In this dissertation we present a methodology for predicting the best priority pair for a given co-schedule of two application threads. Our approach exploits resource-utilization information that is collected during an application thread\u27s execution in single-threaded mode. This information provides insights about the availability of resources that are shared by threads concurrently executed in simultaneous multithreading (SMT) mode for use by another co-scheduled application thread. The main contributions of this dissertation are: (1) Demonstration of the efficacy of using non-default hardware thread priority pairs to improve SMT core throughput: Using a POWER5 simulator, we show that equal (default) priorities are not the best for 82% of the 263 application trace-pairs studied. (2) The concept of a Shareable Resource Signature : this signature characterizes an application\u27s utilization of critical shareable SMT core resources during a specified execution time interval when executed in single-threaded mode. (3) A best priority pair prediction methodology: Given shareable resource signatures of an application-thread pair, we present a methodology to predict the best priority pair for the application-thread pair when co-scheduled to run in SMT mode. (4) An implementation and validation of the methodology for the IBM POWER5 processor, which shows that the following: (a) 17 of 10,000 possible signatures are sufficient to characterize 95.6% of the execution times of a set of applications that consists of 20 SPEC CPU2006 benchmarks (1 data input), three NAS NPB benchmarks (3 data inputs), and 10 PETSc KSP solvers (12 data inputs). The cgs and lsqr PETSc KSP solvers have signatures that are independent of input data, while one of three NAS NPB benchmarks (bt-mz) has a signature that is independent of the input data. (b) For 21 co-schedules of applications, each with a signature that characterizes 95% of its execution time, our validation study shows the following: (i) Predicted best priorities yield higher throughput than default priorities for all but one of the 21 co-schedules. Initial results showed that two co-schedules (462.libquantum, 437.leslie3d) and (bt-mz.A, lu-mz.A) experience a throughput loss of 7.46% and 20.05%, respectively, at predicted priorities, as compared to that achieved at default priorities. Further investigation shows that for the co-schedule (bt-mz.A, lu-mz.A) mapping and executing the co-schedule with the predicted best priorities on hardware threads (5, 4), instead of (4, 5), results in a 3.56% higher throughput as compared to default priorities – this is in contrast to the 20.05% throughput loss experienced when executed on hardware threads (4, 5). Although we have not verified it, one possible reason for this is that the processor core favors one hardware thread over the other. Re-executing the co-schedule (462.libquantum, 437.leslie3d) on hardware threads (5,4), instead of (4, 5), results in predicted priorities yielding lower throughput than the default priorities. Thus, we claim that predicted best priorities yield equal or higher throughput than default priorities for 20 of the 21 co-schedules studied, and for the outlier the throughput loss is 7.46%. (ii) Using non-default priorities improves throughput. The default priority pair yields best throughput for only six of the 21 co-schedules. For the remaining 15 the default priority pair yields throughput that is between 0.74% and 14.10% lower than that achieved with the best priority pair. (iii) Using the predicted best priority pair, rather than default priorities, improves throughput or at least provides throughput equal to that achieved with default priorities. For 11 of the 21 co-schedules both the default and predicted priorities yield equal throughput. For nine of the 21 predicted priorities yield throughput that is between 0.59% and 16.42% higher than that achieved with default priorities. For two of these nine co-schedules the predicted priority pair yields a throughput improvement of less than 5%. Furthermore, for three the throughput improvement associated with executing with the predicted priority pair, rather than default priorities, is between 5% and 10% and for the other four the improvement is greater than 10%. (iv) Using predicted best priority pairs appears to be most applicable to floating-point intensive applications: For eight co-schedules comprising applications for which the utilization of the floating-point unit exceeds that of the fixed-point unit by 10% or more, the predicted priority pairs, as compared to the default priorities, yield a throughput improvement between 3.56% and 16.42%. This result indicates that the methodology for predicting best priority pairs is most applicable to applications for which floating-point unit utilization dominates that of the fixed point unit by at least 10%. (Abstract shortened by UMI.

    PIR: PMaC’s Idiom Recognizer

    No full text
    Abstract—The speed of the memory subsystem often constrains the performance of large-scale parallel applications. Experts tune such applications to use hierarchical memory subsystems efficiently. Hardware accelerators, such as GPUs, can potentially improve memory performance beyond the capabilities of traditional hierarchical systems. However, the addition of such specialized hardware complicates code porting and tuning. During porting and tuning expert application engineers manually browse source code and identify memory access patterns that are candidates for optimization and tuning. HPC applications typically contain thousands to hundreds of thousands of lines of code, creating a laborintensive challenge for the expert. PIR, PMaC’s Static Idiom Recognizer, automates the pattern recognition process. PIR recognizes specified patterns and tags the source code where they appear using static analysis. This paper describes the PIR implementation and defines a subset of idioms commonly found in HPC applications. We examine the effectiveness of the tool, demonstrating 95 % identification accuracy and present the results of using PIR on two HPC applications. Keywords-automation; performance; static analysis; tuning I
    corecore